159 research outputs found
A Trie-Structured Bayesian Model for Unsupervised Morphological Segmentation
In this paper, we introduce a trie-structured Bayesian model for unsupervised
morphological segmentation. We adopt prior information from different sources
in the model. We use neural word embeddings to discover words that are
morphologically derived from each other and thereby that are semantically
similar. We use letter successor variety counts obtained from tries that are
built by neural word embeddings. Our results show that using different
information sources such as neural word embeddings and letter successor variety
as prior information improves morphological segmentation in a Bayesian model.
Our model outperforms other unsupervised morphological segmentation models on
Turkish and gives promising results on English and German for scarce resources.Comment: 12 pages, accepted and presented at the CICLING 2017 - 18th
International Conference on Intelligent Text Processing and Computational
Linguistic
Unsupervised morphological segmentation using neural word embeddings
This is an accepted manuscript of an article published by Springer in Král P., Martín-Vide C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science, vol 9918 on 21/09/2016, available online: https://doi.org/10.1007/978-3-319-45925-7_4
The accepted version of the publication may differ from the final published version.We present a fully unsupervised method for morphological segmentation. Unlike many morphological segmentation systems, our method is based on semantic features rather than orthographic features. In order to capture word meanings, word embeddings are obtained from a two-level neural network [11]. We compute the semantic similarity between words using the neural word embeddings, which forms our baseline segmentation model. We model morphotactics with a bigram language model based on maximum likelihood estimates by using the initial segmentations from the baseline. Results show that using semantic features helps to improve morphological segmentation especially in agglutinating languages like Turkish. Our method shows competitive performance compared to other unsupervised morphological segmentation systems.Published versio
On the Effectiveness of Dataset Embeddings in Mono-lingual, Multi-lingual and Zero-shot Conditions
Recent complementary strands of research have shown that leveraging
information on the data source through encoding their properties into
embeddings can lead to performance increase when training a single model on
heterogeneous data sources. However, it remains unclear in which situations
these dataset embeddings are most effective, because they are used in a large
variety of settings, languages and tasks. Furthermore, it is usually assumed
that gold information on the data source is available, and that the test data
is from a distribution seen during training. In this work, we compare the
effect of dataset embeddings in mono-lingual settings, multi-lingual settings,
and with predicted data source label in a zero-shot setting. We evaluate on
three morphosyntactic tasks: morphological tagging, lemmatization, and
dependency parsing, and use 104 datasets, 66 languages, and two different
dataset grouping strategies. Performance increases are highest when the
datasets are of the same language, and we know from which distribution the
test-instance is drawn. In contrast, for setups where the data is from an
unseen distribution, performance increase vanishes
Incorporating word embeddings in unsupervised morphological segmentation
This is an accepted manuscript of an article published by Cambridge University Press in Natural Language Engineering on 10/07/2020, available online: https://doi.org/10.1017/S1351324920000406
The accepted version of the publication may differ from the final published version.© The Author(s), 2020. Published by Cambridge University Press. We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data which make it fully unsupervised and require only a small amount of raw data together with pretrained word embeddings for training purposes. The results show that using dense vector representations helps in morphological segmentation especially for low-resource languages. We present results for Turkish, English, and German. Our semantic MLE model outperforms other unsupervised models for Turkish language. Our proposed models could be also used for any other low-resource language with concatenative morphology.This research was supported by TUBITAK (The Scientific and Technological Research Council
of Turkey) with grant number 115E464.Published versio
UDapter:Language Adaptation for Truly Universal Dependency Parsing
Recent advances in multilingual dependency parsing have brought the idea of a
truly universal parser closer to reality. However, cross-language interference
and restrained model capacity remain major obstacles. To address this, we
propose a novel multilingual task adaptation approach based on contextual
parameter generation and adapter modules. This approach enables to learn
adapters via language embeddings while sharing model parameters across
languages. It also allows for an easy but effective integration of existing
linguistic typology features into the parsing network. The resulting parser,
UDapter, outperforms strong monolingual and multilingual baselines on the
majority of both high-resource and low-resource (zero-shot) languages, showing
the success of the proposed adaptation approach. Our in-depth analyses show
that soft parameter sharing via typological features is key to this success.Comment: In EMNLP 202
- …